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ABSTRACT 



A pattern recognition system and method is disclosed. The 
method includes the steps of a) providing a noisy test feature 
set of the input signal, a plurality of reference feature sets of 
reference templates produced in a quiet environment, and a 
background noise feature set of background noise present in 
the input signal, b) producing adapted reference templates 
from the test feature set the background noise feature set 
and the reference feature sets and c) determining match 
scores defining the match between each of the adapted 
reference templates and the test feature set. The method can 
also include adapting the scores before accepting a score as 
the result The system and method are described for both 
Hidden Markov Model (HMM) and Dynamic Time Warping 
(DTW) scoring units. The system performs the steps of the 
method. 

18 Claims, 3 Drawing Sheets 
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PATTERN RECOGNITION SYSTEM AND utterance, and a given set of reference word templates. For 

METHOD each reference templates, the algorithm determines a global 

similarity score between the test utterance and the reference 

FIELD OF THE INVENTION template. Hie test utterance is identified by the reference 

The present invention relates to pattern recognition gen- 5 tcm ^ J^ds the highest similarity score, 

erally and to speech recognition in adverse background J 11 a HiddeD Markov Model (HMM) approach, each 

noise conditions in particular. reference word is represented by a model consisting of a 

sequence of states, each characterized by a probability 

BACKGROUND OF THE INVENTION distribution. In the recognition procedure, a dynamic pro- 

^ . 10 gramming algorithm is applied to find the best match 

Prior art speech recognition systems analyze a voice between the test utterance feature vectors and the reference 

signal and compare it to stored speech patterns in order to word states. A probability that the test feature vectors 

determine what was said. When the stored speech patterns correspond to the given reference template is computed. The 

and the voice signal under analysis are acquired in different test utterance is identified as the reference word which yields 

environments the partem similarity is corrupted by the the greatest probability. 

unmatched conditions which effect leads to recognition 15 All of these methods may be extended to the recognition 

errors . of connected or continuous speech by finding a sequence of 

Prior art speech recognizers typically implement super- reference templates which best match the connected speech 

vised learning, or training, in order to provide stored speech test utterance in the sense that it provides a best global 

patterns. Supervised learning is performed in a "clean" 20 simi ^ rity score * A 6 Iobal similarity score algorithm is 

environment (e.g. one with little or no background noise) described in the article "A Model Based Connected Digit 

In the training phase, a speech recognizer "learns" a ^P* 2 ? Using Eithcr Hiddcn Markov Models or 

reference vocabula^ by d«S?lS?^ of pXns * L " * G V ^ ™> B H 

known as templates, representing acoustical features of the ^ f^ST' * Vol. 1. Dec. 

words conforming to the vocabulary. 25 PP ' i0/-iy7 * 

A testing phase, during which words to be recognized are J^^^t ° n .° f * c f ? ned metfl " 

spoken, known as test utterances, is performed binaural " s P^7*f"> n « w »fh the global similarity 

environment which is typically noisy^uring the pha^e toe 1"? ? * * % ° ° T «* 

acoustical features of the word to be recogLed are the /^tterance contains a given reference word or words, 

extracted and compared with those of each template By 30 10 me methods o^ed above, there is a similarity 

selecting the templates) showing the maximum similarity a me f s ^ rement between feature vectors of the test utterance 

decision about the utterance being tested can be reached. feature vcctors stored * templates or models. This 

Speech is a non-stationary process and therefore, speech ^^TLl^ * <USt ° Iti ° n 

recognizers segment spoken words into time frames of T ^ f* * 1**1**1* ° f D ° 1Se ' ° r 

approximately 20to 30 ^ec. T^ese time fr*J*™tJ- » cZJ^ ^fl <f^ces in the background noise 

cally assumed to be stationary. charactensUcs of the training and testing phases. 

Th^ fl«Yn«riV a i f Mh ,^ «,«»:^ 1 u 1^ . Prior art speech recognition systems resolve the problem 

calW ^r^n^^'r^^^^^ hereinabove^ are typi- by m a » cIcan . environment and b J ^ 

2 y a E£tf o?f f ^ aD r lT bm ^ CthCI enhancement techniques to the noisy test wcVds in orterto 

n^wl^LlT™ ^ fra T' ™ C m0St 40 input to the recognition system noise reduces utterances. For 

commonly used features are the coefficients of an autore- 40 mnmi* t u t u™-., a ^, > . 

gressive model or a transformation of thenx Typical features SffiVJ^i^p^ ^ £T r l °- C0D " 

include the linear Pr^H,>ti rt » n^ffi^ * o * strained Iterative Speech Enhancement with Applications to 

^miLt *t n l % * cdiCtl0 ° Coefficients, Cepstrum Automatic Speech Recognition", published ^Proceedings 

coefficients. Bank of filter energies, etc. In general, feature m ,u* i„ fJ .JL*^„s a J ' L % 

sets reflect vocal tract characteristics. l„ nl ?<* TT*' 

. . . , 45 Processing 1988. pp.561-564, disclose a preproces- 

Snort time spectral estimations of segments of speech can sor that •'would produce speech or recognition features 
be obtained from such sets of coefficients according to which are less sensitive to background noise so that existing 

methods known in the art recognition system may be employed". 

A detailed description of different sets of features may be A similar approach is offered by Y. Ephraim et al. in "A 

found in "Digital Processing of Speech Signals" by L. R. ^ Linear Predictive Front-End processor for Speech Recogni- 

Rabiner et al., Prentice Hall. Chapter 8. tion in Noisy Environments". Proceedings of the Interna- 

Speech Recognition systems can be classified as follows: tlonal Conference on Acoustics. Speech and Signal Process- 

Isolated Word Recognition. Connected Speech Recognition ing 1987, pp. 1324-1327. Their system "takes into account 

and Continuous Speech Recognition. Alternatively, they can the noise presence in estimating the feature vector" in order 

be classified as Speaker Dependent systems which require 55 'to make existing speech recognition systems, which have 

the user to train the system which utilize data bases con- proved to perform successfully in a laboratory environment, 

taining speech of many speakers. A description of many immune to noise". 

available systems can be found in "Putting Speech Recog- It is also known to noise adapt the templates to the current 

nizers to Work". P. Wallich. IEEE Spectrum. April 1987. pp. noise leveL The word templates are adapted to noise by 

5 * 60 adding an estimated noise power spectrum to the template 

There are many approaches to recognizing speech. The sequence of power spectra. The power spectra is computed 

Dynamic Programming approach, as described in U.S. Pat via a fast fourier transform (FFT). Such methods are 

No. 4,488.243 to Brown et al. stores a feature vector for each described in the following articles: 

time frame and the entirety of feature vectors arc utilized as D. H. Klatt. "A Digital Filter Bank for Spectral 

a time series of vectors. Through a dynamic programming 63 Matching", ICASSP79. pp. 573-576- 

algorithm, the Dynamic Programming approach identifies Bridle et al., "A Noise Compensating Sriectrum Distance 

the best match between an uttered word, known as the test Measure Applied to Automatic Speech Recognition", 
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Proc. Inst AcousK Autumn Meeting. Windermere. BRIEF DESCRIPTION OF THE DRAWINGS 

Great Britain. Nov. 1984; ^ invention will be understood and appreciated 

J. N. Holmes and N, Sedgwick, 4< Noise Compensation for more ^ from thc following detailed description taken in 

Speech Recognition Using Probabilistic Models". con j unct ion with the drawings in which: 
ICASSF86. pp. 913-916; 5 Rfi j is a block diagram illustration of a template 

B. A. Mellor and A. P. Varga. "Noise Masking in a adapting patteni recognition system adapted to perform 

Transform Domain**. ICASSF93. pp.H-87-11-90; Dynamic Time Warping global scoring, constructed and 

Yang and R Haavisto. '"Noise Compensation for Speech operative in accordance with a preferred embodiment of the 
Recognition in Car Noise Environment". ICASSF95, ^ present invention; 

pp. 443-436; and FIG. 2 is a graphical illustration of the conditions for 

L Sanches and D. M. Brookes, 'Improved Speech Rec- acceptance of a result, useful in the system of FIG. 1; 

ognition Through the Use of Noise-conn^nsated Hid- block diagram Ulustration of the hardware 

den Markov Models". ICSPAT. Boston, Oct 1995, pp. clements which i^^t ^ system of FIG. 1; and 

A Sefa^icle. "An Hypothesized Wiener Filter 15 FIG. 4 is a block diagram i^stration of an alternative 

Ar*rcTtc >Noisy Sr^echRec^Sidon-byA.Bersteinand e«^ n V> f t*J?Tu° ^ 1 t0 P^ 0 ™ 

L Shalom, ICASSP 5l, pp. 913^16. describes a series of Hidden Markov Model global scoring. 

Wiener filters built using the hypothesized clean template DETAILED DESCRIPTION OF PREFERRED 

which are applied to the feature vectors of the noisy word. M EMBODIMENTS 

SUMMARY OF THE PRESENT INVENTION fa ^ ^ tQ nG , ^ ^ 

It is therefore an object of the present invention to block diagram form, a pattern recognition system con- 

A pattern recognition system and method is disclosed. The structed and operative in accordance with the present inven- 

method includes the steps of a) providing a noisy test feature M t i on The pattern recognition system will be described in the 

set of the input signal, a plurality of reference feature sets of context of a speech recognition system, it being understood 

reference templates produced in a quiet environment, and a mat an y type of pattern can be recognized, 

background noise feature set of background noise present in speecn recognition system typically comprises an 

the input signal, b) producing adapted reference templates . (Jevice g $uch as a microphone or a telephone handset 

from the test feature set. the background noise feature set M fof acquirill g a spe ech utterance in a necessarily quiet 

and the reference feature sets and c) determining match environment for training (i.e. reference creation) and in a 

scores defining the match between each of the adapted n0 n-necessarily quiet environment for recognition (i.e. test 

reference templates and the test feature set. The method can utterance) 

also include adapting the scores before accepting a score as ' additionally comprises a band pass filter 10 

the result The system and method are described for both In * me h and for eliminating from 

ffia^MarkovModel^ £^^!Si^u™to below a first frequency. 

(DTW) sconng units. The system performs the steps of the rf ^ Hz a second frequencV( 

method. ^ , , of 3200 Hz. TVpically, band pass filter 10 is also an anti- 

Additionally. in accordance with a preferred embodiment ^ ^ m ^ enablc $&mpUng of me 

of the present invention, creating the reference templates, for ^ soe ^iT uttcnDCt 

both the DTW and HMM implementations, involves raising " . . . 

Sk gain level of a referent feature set to the value of the For other types of pattern recognition ^systems, the input 

Sterfte aveSe energy of the test feature set and device 8 is any type of input device <*P^ 

the average energy of L bac^ound noise feature set and reference and test signal In such svste ^ f ^ 

ad usting the gai^raised referee feature set by the back- 45 condmorung me inpu and r^epanng n *^toW 

ground noise feature set For the HMM implementation, the 45 conversion are typically substituted for the band pass filter 

adapted reference set includes an adaptation of the current 10 - _._,..„ 

frame of the reference feature set and a "next-frame" refer- The speech recognition system additionally comprises an 

ence feature set Analog-to-Digital Converter (ADC) 12 for sampling the 

Moreover, in accordance with a preferred ernbodiment of M analog band-passed speech utterance, tjjicaUj , at a 1 8000 Hz 

the present invention, for the OTW implementation, the sampling rate, and a frame segmenter 14 for segmenting the 

scores are accepted if the signal to nZ ratio and score sampled speech utterance into frames of approximately 30 

value are within predetermined values. msec m length. 

Further, in accordance with a iff eferred embodiment of the An autocorrelator 16 determines the autocorrelation coef- 

present invention, the features sets are autocorrelation fea- 55 ficients R(i) of the frame, in accordance with standard 

aire sets and the global scoring operation operates on autocorrelation techniques. 

cepstral representations of the autocorrelation feature sets. The autocorrelation output is provided to a voice operated 

For the HMM implementation, the features sets are auto- switch (VOX) 18 for identifying when no speech utterance 

correlation feature sets for the current frame and for a next is present The datapoints of the frame which have noise in 

frame 50 them will be provia^ to a backgrwind noise estimation unit 

Still further, in accordance with a preferred ernbodiment 30 while the datapoints with speech therein will be provided 

of the present invention, the HMM implementation includes to a speech processing unit 31. 

the step of accepting involves selecting a match score which A suitable VOX 18 is described in U.S. Pat No. 4.959.865 

is best in accordance with a predetermined criterion. to Stettin er et al. For other pattern recognition systems. VOX 

Finally, in accordance with a preferred exm^xliment of the 65 18 is typically replaced by a suitable detector typically for 

present invention, the system of the present invention per- detecting the moment that the signal energy rises above a 

forms the steps of the method. background noise level. 
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The speech and noise segments are provided to a signal tics of noise present between words (i.e. when no speech is 

strength unit 20 which determines the signal to noise ratio of present) and for computing a noise feature vector, and a 

the speech segment. noise template storage unit 40 for storing the computed 

A switch 22, operated together with a switch 24. switches noise template for later utilization by template, 

the system between reference and test modes. In the refer- 5 The background noise estimator 38 is typically an aver- 

ence mode (switch position REF) the system learns a ager which produces, as the noise feature vector, the average 

predetermined set of template patterns. value of an Autocorrelation Function (ACF) of the input 

In the test mode (switch position TST), the system oper- signal of the frames having no speech activity. The noise 

ates on a test utterance. Because switch 24 is connected to feature vector is the noise template. In p articular, the bac k- 

the background noise estimation unit 30, which is operative 10 ground noise feature v ector, denoted R„. is evaluated when- 

only during the test mode, when switch 22 is positioned in e ve? There is background noise only, typic all y both before 

the R position, switch 24 is forced to be open, as shown in aid alter a speech utterance is spoken . 

FIG. 1. — For each noisy speech frame under test, template adapter 

It will be noted that switches 22 and 24 are for illustration 32 takes as input the speech feature vector R, of the noisy 

only; they depict the connections between different steps 15 speech frame, the stored background noise template R„ and 

performed by a microprocessor, described in more detail * frame R r of the reference template whose similarity to the 

hereinbelow with respect to FIG. 4, and are typically imple- speech feature vector is to be measured, 

raented in software. The template adapter then adapts the normalized refer- 

For the reference mode, the system comprises a template ettce template. Specifically, the gain level for the reference 

creation block 26 for creating reference templates from the 20 template R r is raised to the value of the difference of me 

auto correlation coefficients R. (I) and a reference template average energy of the test utterance and the average energy 

storage block 28 for storing the reference templates until °f the noise signal. In addition, the reference template R r is 

they are needed. adjusted by the noise template. Mathematically, the adapted 

Template creation block 26 first normalizes the autocor- reference template R r ' is defined as: 

relation coefficients R, by the average speech energy <R„ 25 w<*jai>-KjaML+* m 

(0)> of the reference word, computed between speech K,H<*W^KJSWt,+K m (2) 

endpoints, as follows: where { s me average speech energy of the test word 

KA(F>RAiy<R£0p computed between speech endpoints and R„(0) is the energy 

Template creation block 26 then creates the reference tern- 30 ^ » oise ^ 85 denotwi b * mc ^ Correlation 

plates according to well known techniques, such as Dynamic e ' _ . . m . , ..... ^ 

Time Warping (DTW). Vector Quantisation (VQ) or Hidden D ad ^ re , fer «" ***** V k ^ 

Markov Model (HMM) R, are provided to determiner 33 which performs a linear 

For DTW. reference templates are comprised of a ^T^ 00 ^^ ^^Z^T.^^i 

sequence of feature vectors f7the entirety of Lmes form- 35 r n eta *°? O method r as ^ crib « d c 1D » ^ 

ing a spoken word. For VQ. each reference template is *°ZTZ£v?? 2d ' ^ , 

represented by a sequence ofindices of VQ codeZds and Schaff^pubkshedby Frent.ce HaU. Inc.. Englewood Cliffs, 

for HMM eadh reference template, also known as a model. Y" 19 ! 8 - mcorporated herein by reference 

is representedby a sequence of probability distributions. For FrommepLPC c<)effic,ents aforeach feahire vector unit 

HMM-VQ, me HMM model ifbased on a VQ codebook. 40 ^ermines associated "Pf™ coeffiaents C, and 

which is common to all templates. C ' ™ e «*"««» !««•« » as ™»™ 

When switch 18 is set to test mode (TST position), switch ._ { 

24 is automatically closed and input speech is acquired in a Ci=o^-'z 4- Ck*-** - 1/> 

typically noisy environment. ^ ' 

In accordance with a preferred embodiment of the present 45 where p typically has a value of 10. 

invention, for the test mode, the system additionally com- The feature vectors, denoted C*, which determiner 33 

prises the background noise estimation unit 30, a template provides to the global scoring unit 34 also include the base- 2 

adapter 32. an LPC and cepstnim determiner 33, a global logarithm of the energy of the zeroth component R(0) of the 

scoring unit 34 and a decision unit 36. The background noise autocorrelation of each signal Thus: 

estimation unit 30 estimates the spectral properties of the 50 

background noise. The template adapter 32 adapts the ref- c * *gAT0),c;<i). . . . cy(p)| (4) 

erence template, denoted R r to the particular additive noise c/=fk>&*/0) c/i), c&>)] (5) 
present in the current test. The LPC and cep strum determiner 

33 converts the adapted reference template, denoted R r \ and The cepstnim coefficients C* and C * are provided to the 

the test feature set, denoted R r to the cepstral format and 55 global scoring unit 34 which produces a local similarity 

determines the linear prediction (LPC) coefficients associ- measure S between the adapted reference template and the 

ated therewith. The global scoring unit 34 produces a global test utterance. 

score for the similarity of the adapted reference template R/ In the DTW approach, a warping function giving the best 

and the test utterance 1^. The decision unit 36 adapts the time alignment between two sequences of features is 

global score in accordance with the level of the signal to 60 searched. A global distance accumulating the local distances 

noise ratio produced by unit 20, over the warping function represents the sirnilarity between 

The following discussion will describe the operation of the words. A detailed explanation of the DTW algorithm can 

the present invention for a global scoring 34 perfonmng be found in the article, incorporated herein by reference, by 

DTW. Afterwards, the operation will be described for one H. Sakoe and S. Chiba entitled "Dynamic Programming 

performing HMM. 65 Algorithm Optimization for Spoken Word Recognition:", 

The background noise estimation unit comprises a back- IEEE Transactions on Acoustics, Speech and Signal 

ground noise estimator 38 for estimating noise characteris- Processing, Vol. 26, 1978, pp.43-49. 
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The global similarity score S between the current refer- microprocessor 54 for implementing the remaining elements 

ence template and the spoken word is saved and the process of the block diagram of FIG. 1. 

repeated using the next reference template. When the com- Microprocessor 54 typically works in conjunction with a 

parison of the entirety of reference templates is completed. Random Access Memory (RAM) 58 and a Read Only 

the reference template most similar to the spoken word is 5 Memory (ROM) 60, as is known in the art. RAM 58 

selected as the recognized word, wherein the term "most typically serves to implement reference template storage 

similar" is defined as is known for DTW algorithms. The unit 22 and noise template storage unit 38. ROM 60 is 

recognized word and its score S are provided to the decision operative to store a computer program which incorporates 

UQ ^ i 3^ the method of the present invention. Data and address buses 

The decision unit 36 rejects recognized words whose 10 connect the entirety of the elements of FIG. 3 in accordance 

recognition score S is too poor. Unit 36 has a noise-adapted with conventional digital techniques, 

acceptance criterion which is a function of the score level Input device 50 may be. as mentioned hereinabove, a 

and of the signal to noise ratio (SNR) of the speech segment. microphone or a telephone handset. CODEC 52 may be a 

as provided by unit 20. type TCM29cl3 integrated circuit made by Texas Instru- 

FIG 2 illustrates the acceptance criterion and is a graph is ments Inc.. Houston, Tex. RAM 58 may be a type 

of the score value versus SNR. The graph of FIG. 2 is LC3664NML 64K bit Random Access Memory manufac- 

divided into three sections, an acceptance area 80, a rejec- tured by Sanyo, Tokyo, Japan. ROM 60 may be a 12&K bit 

tion area 82 and a conditional acceptance area 84. Programmable Read Only Memory manufactured by 

Furthermore, there are two SNR thresholds, SNR0 and Cypress Semiconductor. San Jose, Calif. Microprocessor 54 

SNR1. and three score thresholds, SCORE0. SCORE 1 and 20 may be a TMS320C25 digital signal microprocessor made 

SCORE2. Texas Instruments Inc.. Houston, Tex. The output device 

The acceptance area 80 is bounded by the line 90 at 56 may be a dialing mechanism, a personal computer or any 

SNR=SNR0 ending at the point (SNR0, SCORE0), the line other device to be activated by known voice commands. 

92 at S=SCOREl beginning at the point (SNR1. SCORED Alternatively, it may be apparatus for communicating the 

and the line 94 connecting the two points (SNR0, SCORE0) 25 identified word or words to a communication channel or for 

and (SNR1. SCORE1). storing the identified word or words. 

The conditional acceptance area 84 accepts a score on the A second embodiment of the invention, which makes use 

condition that the two templates with the best scores for that of the Hidden Markov Model global similarity approach, 

word are of the same word. The lower bound of conditional will now be discussed with reference to FIG. 4. Elements of 

acceptance area 84 is defined by the lines 94 and 92, and the 30 FIG. 4 which are the same as those of FIG. 1 have similar 

upper bound of area 84 is defined by lines 96 and 98, where reference numerals and therefore, will Dot be described 

line 96 is at S=SCORE2 beginning at the point (SNR1. hereinbelow. A tutorial description of HMM is given in the 

SCORE2) and line 98 connects the points (SNR0, SCORE0) paper incorporated herein by reference by L. R. Rabiner. as 

and (SNR1, SCORE2). Any other scores are to be rejected follows: "A Tutorial on Hidden Markov Models and 

indicating that no reference template could successfully be 35 Selected Applications in Speech Recognition", Proceedings 

matched to the test utterance. */ the IEEE. Vol 77, No2. Feb 89, pp 257-286. 

It will be appreciated that the reference templates can be The system of FIG. 4 utilizes an HMM template creator 
any type of template. They can consist of a plurality of 100. an HMM template adapter 102, an HMM global 
different words spoken by one person, for identifying the scoring unit 104 and an HMM decision unit 106, 
spoken word or words, or they can consist of average 40 respectively, rather than the template creator 26, the tern- 
properties of utterances spoken by a plurality of people for plate adapter 32. the global scoring unit 34 and the decision 
identifying the speaker rather than his words. In speech unit 36 of FIG. 1. In addition, the system of FIG. 4 has no 

recognition, each template represents a word or portion of a JJPC and cepstnim determiner 33. 

word in the vocabulary to be recognized. In speaker / The HMM template creator 100 produces the HMM 
u ♦ i„»„ *tk* iA+ntitv a rvr. a< rrfmnre. temnlfltps from the inrxit autocorrelation feature 



recognition, each template represents the identity of a per- 
son. Reference templates are described in the following 
article, incorporated herein by reference: G. Doddington. 
"Speaker Recognition: Identifying People by Their Voices, 
"Proceedings of the IEEE No. 73, 1985, pp. 1651-1664. 
It will be appreciated that the system of the present 



reference templates from the input autocorrelation feature 
sets of speech. As known in the art of HMM, each reference 
word is modelled by a sequence of states and the probability 
density function of the state is modelled as a mixture of 
multivariate, diagonal Gaussian probabilities. Both tied and 
5$ non-tied mixtures can be used. Each mixture is characterized 



invention can alternatively perform connected or continuous by a mean and variance over the acoustic feature space (for 

speech recognition. In such a system, the global scoring unit i example, of dimension 22). The mixture parameters are 

34 will select the best sequence of reference templates which \ estimated by a standard iterative K-means algorithm using a 

yields the best total similarity score. \ Viterbi alignment and form the basis for the reference 

Once a positive decision is reached (i.e. no rejection), an 55 1 template, 

output device (not shown), such as a voice actuated device. | Initially, the HMM template creator 100 converts the 

a communication channel or a storage device, is operated in I input autocorrelation feature set R, into its cepstnim coef- 

response to the meaning of the recognized word or words ficients C*, where the vector C* is defined in equation 4. In 

contained in the speech utterance or in response to the accordance with this second preferred embodiment of the 

identity of the speaker. *o present invention, the template creator 100 determines the 

Reference is now briefly made to FIG. 3 which illustrates time derivative of the cepstnim. Let C+* indicate the cep- 

a hardware configuration for implementing the block dia- strum for a "next frame". HMM template creator 100 then 

gram of FIG. 1. The system typically comprises an input determines the time derivative AC* as follows: 

device 50 for acquiring a speech utterance or background &c*=C**-C* (6) 

noise, a COder-DECoder (CODEC) 52 for implementing the 65 

band pass filter 10 and the ADC 12. an output device 56 for The HMM template creator 100 then determines the mean 

operating in response to the identified word or words, and a and variance parameters for each mixture (in the cepstrum 
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plane) and the cepstral mean C* of a mixture is inverse- 
transformed into an autocorrelation vector R r The template 
creator 100 then adds the cepstnim time derivative AC* to 
the cepstral mean forming, thereby, a representation of the 
"next frame** cepstral mean C r *\ as follows: 



C,**=C F m +&C* (7) 

The HMM template creator 100 then inverse transforms 
the next frame cepstral mean C r * + to produce the "next 
frame" autocorrelation R/. Finally, the feature set for the 
reference word is defined as (R^Ji/*}. 

For each noisy speech frame under test, the HMM tem- 
plate adapter 102 takes as input the speech feature vector R, 
of the noisy speech frame, the stored background noise 
template R„ and a feature vector (R^/) of the reference 
template whose similarity to the speech feature vector is to 
be measured. The template adapter 102 noise-adapts the 
probabilities of the reference template. This is performed by 
adapting the means of the Gaussian mixtures to account for 
the additive noise. The variances are not adapted since the 
effect of the noise on them is small. For the cepstral means, 
the HMM template adapter 102 performs equation 2 on both 
elements (of the "current** and "next" frames) of the refer- 
ence feature vector. Specifically: 

In addition, the HMM template adapter 102 adapts Die 
cepstral derivative of the means. Initially, the adapter 102 
converts the adapted reference feature vector (R r \ R/ 1 ) of 
the reference template to their cepstral representatives C,*' 
and C,* 4 *. The adapter 102 then adapts the cepstral deriva- 
tive of the means as follows: 

AC/«V~'-C/' (9) 

The HMM template adapter 102 also produces the cepstral 
representation C,* of the test feature set R r 

The HMM global scoring unit 104 performs the HMM 
scoring operation on the cepstral feature sets C*\ C***, C r *' 
and C r *. For each reference template L the global scoring 
unit 104 produces a separate score S,. 

Finally, the HMM decision unit 106 adapts the scores S, 
produced by the global scoring unit 104 by normalizing each 
one by the average <Sj> of the other scores. Urns, 

s t ^Sr^/>< i*j (10) 

Hie HMM decision unit 106 selects the word whose adapted 
score St is best or is above a predetermined threshold level 

It will be appreciated by persons skilled in the art that the 
present invention is not limited to what has been particularly 
shown and described hereinabove. Rather the scope of the 
present invention is defined only by the claims which 
follow: 

We claim: 

1. A pattern recognition system comprising: 

means for providing a test feature set of a generally noisy 
input signal characterizing at least a portion of an input 
pattern contained within said input signal; 

means for providing a plurality of reference feature sets of 
reference templates produced in a quiet environment; 

means for providing a background noise feature set of 
background noise present in said input signal; 

a template adapter for producing adapted reference tem- 
plates from said test feature set. said background noise 
feature set and said reference feature sets; and 
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a global scoring unit for determining match scores defin- 
ing the match between each of said adapted reference 
templates and said test feature set. 
wherein said feature sets are autocorrelation feature sets 
5 and said template adapter includes: 

means for raising the gain level of a reference feature 
set to the value of the difference of the average 
energy of said test feature set and the average energy 
of said background noise feature set; and 
10 means for adjusting said gain-raised reference feature 
set by adding to it said background noise feature set 
thereby to create said adapted reference templates. 

2. A pattern recognition system according to claim 1 and 
also including a signal to noise ratio determiner of the signal 
to noise ratio in the input signal and a decision unit for 

15 accepting at least one of said match scores if the signal to 
noise ratio and score value are within predetermined values. 

3. A pattern recognition system according to claim 2 
wherein said global scoring unit is a Dynamic Time Warping 
(DTW) global scoring unit. 

20 4. A pattern recognition system according to claim 3 
wherein said DTW global scoring unit operates on cepstral 
representations of said autocorrelation feature sets. 

5. A pattern recognition system according to claim 1 
wherein said global scoring unit is a Hidden Markov Model 

25 (HMM) global scoring unit 

6. A pattern recognition system according to claim 5 
wherein said features sets are autocorrelation feature sets for 
the current frame and for another frame and wherein said 
HMM global scoring unit operates on cepstral representa- 

30 tions of said feature sets for the current frame and on a 
cepstral difference between the cepstral representation of 
said feature sets for the current and another frame. 

7. A pattern recognition system according to claim 6 and 
wherein said means for providing a plurality of reference 
feature sets includes: 

means for producing cepstral representations of the auto- 
correlation feature sets of a plurality of frames of a 
plurality of reference signals and the cepstral difference 
between two frames of each reference signal; 
^ means for determining the cepstral mean of a mixture of 
frames; 

means for inverse-transforming the cepstral mean of said 
current frame into an autocorrelation vector for said 
current frame; 

45 means for adding the cepstral mean of the current frame 
to said time difference and inverse -transforming the 
result thereby to produce an autocorrelation vector for 
said another frame; and 
means for generating said autocorrelation feature set from 

so said autocorrelation vector of said current and another 
frame. 

8. A pattern recognition system according to claim 7 
wherein said adapted reference templates includes adapted 
reference templates of said current and another frame and 

55 wherein said template adapter includes means for generating 
an adapted cepstral difference from the cepstral representa- 
tions of said adapted reference templates of said current and 
another frame. 

9. A pattern recognition system according to claim 5 and 
60 also including a decision unit for accepting the best one of 

said match scores in accordance with a predetermined 
criterion. 

10. A method for pattern recognition, the method com- 
prising the steps of: 

65 providing a test feature set of a generally noisy input 
signal characterizing at least a portion of an input 
pattern contained within said input signal; 
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providing a plurality of reference feature sets of reference 

templates produced In a quiet environment; 
providing a background noise feature set of background 

noise present in said input signal; 
producing adapted reference templates from said test 5 

feature set. said background noise feature set and said 

reference feature sets; and 
determining match scores defining the match between 

each of said adapted reference templates and said test JQ 

feature set 

wherein said feature sets are autocorrelation feature sets 
and said step of producing includes the steps of: 
raising the gain level of a reference feature set to the 
value of the difference of the average energy of said 15 
test feature set and the average energy of said back- 
ground noise feature set; and 
adjusting said gain-raised reference feature set by add- 
ing to it said background noise feature set thereby to 
create said adapted reference templates. 20 

11. A method according to claim 10 and also including the 
steps of determining a signal to noise ratio in the input signal 
and accepting at least one of said match scores if the signal 
to noise ratio and score value are within predetermined 
values. 25 

12. A method according to claim 11 wherein said step of 
determining performs Dynamic Time Warping (DTW). 

13. A method according to claim 12 wherein said step of 
determining operates on cepstral representations of said 
autocorrelation feature sets. 30 

14. A method according to claim 10 wherein said step of 
determining performs Hidden Markov Model (HMM) scor- 
ing. 

15. A method according to claim 14 wherein said features 
sets are autocorrelation feature sets for the current frame and 35 
for another frame and wherein said step of determining 
operates on the cepstral representations of said autocorrela- 
tion feature sets for the current frame and on a cepstral 
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difference between the cepstral representations of said fea- 
ture sets for the current and another frame. 

16. A method according to claim 15 and wherein said step 
of providing a plurality of reference feature sets includes the 
steps of: 

producing cepstral representations of the autocorrelation 

feature sets of a plurality of frames of a plurality of 

reference signals; 
producing the cepstral difference between two frames of 

each reference signal; 
detennining the cepstral mean of a mixture of frames; 
inverse-transforming the cepstral means of said current 

frame into an autocorrelation vector for said current 

frame; 

adding the cepstral means of the current frame to said time 
difference and inverse-transforming the result thereby 
to produce an autocorrelation vector for said another 
frame; and 

generating said autocorrelation feature set from said auto- 
correlation vector of said current and another frame. 

17. A method according to claim 16 wherein said adapted 
reference templates includes adapted reference templates of 
said current and another frame and wherein said step of 
producing includes the steps of raising the gain level of a 
reference feature set to the value of the difference of the 
average energy of said test feature set and the average energy 
of said background noise feature set and adjusting said 
gain-raised reference feature set by said background noise 
feature set thereby to create said adapted reference templates 
of said current and includes the step of generating an adapted 
cepstral difference from the cepstral representations of said 
adapted reference templates of said current and another 
frame. 

18. A method according to claim 17 and also including the 
step of accepting the best one of said match scores in 
accordance with a predetermined criterion. 

***** 



25 



30 
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